I mean—I hear what you’re saying, and I don’t agree. It’s not that important of a post, so it’s fine, but I think it doesn’t really hook people.
I don’t think it communicates what the post is about “We Added Typos to a Benchmark. Then Haiku Saturated It.”
I don’t think it’s about saturation at all—as evidence of this, I ran “ctrl-f” for satur and there are 0 mentions to it in the body.
less critically—“we added typos to a benchmark”—this is more of the action than the reason, which kind of just leaves someone asking “why is this relevant”
Got it. I was indeed thinking to frame this blogpost as more lightheartedly. Would this be better: “We added typos to a benchmark, then Haiku’s scores jumped” so there is no mention of “saturation”. I’m thinking the blogpost is more of “why did Haiku’s score jump” instead of “LLMs are robust to typos”.
this is true in this context, but it’s worth saying that these models seem to be trained for these benchmakrs, so it’s not entirely true that they are lower-bounds, since it seems that models are trained to do well on benchmarks.
The size of the “lower bound”ness is hard to comment on, but if you can provide some input on the amount of improvmenet you think there is from tuning your harness, that is a meaningful conclusion. That could be what we build the blog around,
When Anthropic dropped Opus 4.6, we asked it to figure it out. From our eval logs, Opus 4.6 observed that for a single BigCodeBench prompt, Haiku and Opus often generated multiple code blocks. Furthermore, as typo rate increases, Haiku shifts its behavior for ~20% of its responses from generating multiple code blocks to generating just a single code block.
I also have to say, i don’t know what you mean by “code block”—does this mean response?
When Anthropic dropped Opus 4.6, we asked it to figure it out.
I think this is a bit informal.
Also it’s odd to say this because it sounds like you’re saying “this isn’t what we think” or “if it’s wrong, judge it, not us” and you have to claim responsbility for any findings that AI generates.
We then tested whether this “impossible typo effect” holds for Haiku and Opus on other benchmarks. We chose BBH and GPQA since Haiku struggles reasonably without introducing typos. Here, we no longer observed the impossible typo effect, and Haiku’s capabilities decreased with typo rates.
this shouild be 1 figure, not 2
and i think this couild be combined with the previous section as well
We then tested if other small models have this “impossible typo effect”. We found that, unlike Haiku, the capabilities of GPT-4.1-mini slightly decreased as typos increased.
this plot should have multiple models on the same plot, otherwise it’s somewhat confusing section header.
We were curious if LLMs are robust under various typo rates.
Hmm, couple thoughts, i don’t think my write up is perfect, but how about something more like this:
> We were curious if LLMs produce the same response when there are typos in the prompt. To test this, we injected typos into the prompts from BigCodeBench and ran different Claude models. We found that while the accuracy for Opus gradually declined with typo rate, for Haiku, its accuracy actually increased as typo rate increased. This blog investigates this counter-intuitve phenomena.
(points I care about: 1. “robust under various typo rates” doesn’t sound like a linear increase in typos, which is what we actually do.
2. “We double-checked our code and asked Claude to do so too, but we couldn’t find any bugs”—this should be assumed for all your work.
3. “The mystery begins.”—this format does sound quite AI generated
I will say, now that I’ve fully read this, I don’t fully get why this is the title that you prefer.
It seems like
we should say somethign aboutwhat are we measuring or 2 our conclusion
this title leaves me unsure what this post will be about, except for the pace of benchmark saturation, whcih isn’t even the point of the blog
I just think it’s a good hook, taking a leaf from how news pieces are written
I mean—I hear what you’re saying, and I don’t agree. It’s not that important of a post, so it’s fine, but I think it doesn’t really hook people.
I don’t think it communicates what the post is about “We Added Typos to a Benchmark. Then Haiku Saturated It.”
I don’t think it’s about saturation at all—as evidence of this, I ran “ctrl-f” for
saturand there are 0 mentions to it in the body.less critically—“we added typos to a benchmark”—this is more of the action than the reason, which kind of just leaves someone asking “why is this relevant”
Got it. I was indeed thinking to frame this blogpost as more lightheartedly. Would this be better: “We added typos to a benchmark, then Haiku’s scores jumped” so there is no mention of “saturation”. I’m thinking the blogpost is more of “why did Haiku’s score jump” instead of “LLMs are robust to typos”.
this is true in this context, but it’s worth saying that these models seem to be trained for these benchmakrs, so it’s not entirely true that they are lower-bounds, since it seems that models are trained to do well on benchmarks.
The size of the “lower bound”ness is hard to comment on, but if you can provide some input on the amount of improvmenet you think there is from tuning your harness, that is a meaningful conclusion. That could be what we build the blog around,
I also have to say, i don’t know what you mean by “code block”—does this mean response?
I think this is a bit informal.
Also it’s odd to say this because it sounds like you’re saying “this isn’t what we think” or “if it’s wrong, judge it, not us” and you have to claim responsbility for any findings that AI generates.
I would not say this
this shouild be 1 figure, not 2
and i think this couild be combined with the previous section as well
should link to the benchmarks, im actually not immediatley clear what BBH is (but i get it upon further inspection)
I don’t think this is a term you’ve introduced before
this plot should have multiple models on the same plot, otherwise it’s somewhat confusing section header.
I really am fairly opposed to sentences like this
observed result in humans
(or cite something that isn’t a bbc article)
think
it isn’t clear what “squint” means, and I don’t think maybe “think” is better
Hmm, couple thoughts, i don’t think my write up is perfect, but how about something more like this:
> We were curious if LLMs produce the same response when there are typos in the prompt. To test this, we injected typos into the prompts from BigCodeBench and ran different Claude models. We found that while the accuracy for Opus gradually declined with typo rate, for Haiku, its accuracy actually increased as typo rate increased. This blog investigates this counter-intuitve phenomena.
(points I care about: 1. “robust under various typo rates” doesn’t sound like a linear increase in typos, which is what we actually do.
2. “We double-checked our code and asked Claude to do so too, but we couldn’t find any bugs”—this should be assumed for all your work.
3. “The mystery begins.”—this format does sound quite AI generated
)
I think move the image to be right after this intro paragraph